Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi goroutine deal taskUnschedulable #3921

Merged

Conversation

lishangyuzi
Copy link

In the scenario of scheduling large-scale jobs, I also encountered a problem. When the job fails to be scheduled, all the pods under this job will update the PodCondition. Since it is necessary to communicate with the apiserver, this will take a long time.Could we consider using the multi-goroutine approach to handle this part of the logic?

@volcano-sh-bot
Copy link
Contributor

Welcome @lishangyuzi!

It looks like this is your first PR to volcano-sh/volcano.

Thank you, and welcome to Volcano. 😃

@volcano-sh-bot volcano-sh-bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Dec 24, 2024
@lishangyuzi
Copy link
Author

/assign @lowang-bh

@lowang-bh
Copy link
Member

Have you increase the QPS of kubeclient in volcano scheduler?

@lishangyuzi
Copy link
Author

lishangyuzi commented Dec 25, 2024

Have you increase the QPS of kubeclient in volcano scheduler?

default qps of kubeclient has already met my expectations.It takes approximately 200 seconds for a job with 5000 pods to complete this stage.

fs.Float32Var(&s.KubeClientOptions.QPS, "kube-api-qps", defaultQPS, "QPS to use while talking with kubernetes apiserver")
fs.IntVar(&s.KubeClientOptions.Burst, "kube-api-burst", defaultBurst, "Burst to use while talking with kubernetes apiserver")

defaultQPS = 2000.0
defaultBurst = 2000

The parameters related to my API server QPS are as follows:

--max-mutating-requests-inflight=4000
--max-requests-inflight=2000
--watch-cache-sizes=node#2000,pod#10000

"reason", reason, "message", msg)
}
wg.Add(1)
semaphore <- struct{}{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is more better using goroutines pool to update those tasks concurent.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I've already changed this using goroutines pool.

@lowang-bh
Copy link
Member

/ok-to-test

@volcano-sh-bot volcano-sh-bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 29, 2024
@lishangyuzi lishangyuzi force-pushed the optimize-task-unschedulable branch from 5297a55 to bd8de08 Compare December 30, 2024 06:22
@volcano-sh-bot volcano-sh-bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 30, 2024
@lishangyuzi lishangyuzi force-pushed the optimize-task-unschedulable branch 3 times, most recently from 4e32d71 to 8156c0b Compare December 30, 2024 07:41
@Monokaix
Copy link
Member

There is concern that will this cause some conflict err?

@volcano-sh volcano-sh deleted a comment from volcano-sh-bot Dec 30, 2024
@volcano-sh volcano-sh deleted a comment from volcano-sh-bot Dec 30, 2024
@Monokaix
Copy link
Member

/area performance

@volcano-sh-bot volcano-sh-bot added the area/performance Issues or PRs related to performance label Dec 30, 2024
@Monokaix
Copy link
Member

cc @JesseStutler

@lishangyuzi lishangyuzi force-pushed the optimize-task-unschedulable branch from 8156c0b to b6313fd Compare December 30, 2024 08:01
for _, taskInfo := range job.TaskStatusIndex[status] {
statusTasks := job.TaskStatusIndex[status]
workerNum := 16
taskInfos := make([]*schedulingapi.TaskInfo, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please set slice’s capacity as the length of statusTasks

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@lishangyuzi lishangyuzi force-pushed the optimize-task-unschedulable branch 2 times, most recently from 5395b76 to 7f14a89 Compare January 3, 2025 01:43
@JesseStutler
Copy link
Member

/lgtm

@volcano-sh-bot volcano-sh-bot added lgtm Indicates that a PR is ready to be merged. and removed lgtm Indicates that a PR is ready to be merged. labels Jan 3, 2025
@lishangyuzi
Copy link
Author

i merge branch 'master' into my branch, lgtm label has been removed. @JesseStutler

@JesseStutler
Copy link
Member

i merge branch 'master' into my branch, lgtm label has been removed. @JesseStutler

Please squash your commit into one :)

@lishangyuzi lishangyuzi force-pushed the optimize-task-unschedulable branch 2 times, most recently from e35e570 to 084b4bd Compare January 3, 2025 06:27
@lishangyuzi
Copy link
Author

i merge branch 'master' into my branch, lgtm label has been removed. @JesseStutler

Please squash your commit into one :)

done @JesseStutler

@@ -1486,22 +1486,32 @@ func (sc *SchedulerCache) RecordJobStatusEvent(job *schedulingapi.JobInfo, updat
}
// Update podCondition for tasks Allocated and Pending before job discarded
for _, status := range []schedulingapi.TaskStatus{schedulingapi.Allocated, schedulingapi.Pending, schedulingapi.Pipelined} {
for _, taskInfo := range job.TaskStatusIndex[status] {
statusTasks := job.TaskStatusIndex[status]
workerNum := 16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can define a const like jobUpdaterWorker = 16.

Copy link
Author

@lishangyuzi lishangyuzi Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we use the workqueue.ParallelizeUntil function, const is defined in some places, while in others it isn't.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean re-define a const and quote it here: )

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@lishangyuzi lishangyuzi force-pushed the optimize-task-unschedulable branch from 084b4bd to 69641a5 Compare January 7, 2025 07:44
@lishangyuzi lishangyuzi force-pushed the optimize-task-unschedulable branch from a02189f to e24db9d Compare January 7, 2025 07:49
@Monokaix
Copy link
Member

/approve

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Monokaix

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 10, 2025
@lowang-bh
Copy link
Member

/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Jan 17, 2025
@volcano-sh-bot volcano-sh-bot merged commit 37cb882 into volcano-sh:master Jan 17, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/performance Issues or PRs related to performance lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants